Take Home Exercise 3: Segmenting Starbucks Drinks using Visual Analytics

This is the third take home exercise in the series of take home exercises for the module Visual Analytics. This exercise uses multivariate analysis techniques to segment Starbucks drinks in the category ‘Kids drinks and others’ based on Nutrition factors given in the data set starbucks_drinks.csv.

true
2022-02-20

R packages used in this exercise

The following R packages are used in this exercise:

They are loaded using the code chunk given below. The following code chunk passes the list of packages as a vector that is iterated over using a for loop and all the packages in the vector are loaded. The if condition ensures that the packages get installed on running the code chunk, if they are not already installed. Overall, the code chunk checks all the boxes when it comes to using any number of packages in R with ease.

packages = c( 'knitr','rmarkdown','corrplot', 'ggstatsplot','tidyverse','heatmaply','seriation', 'dendextend','GGally', 'parcoords', 'parallelPlot')
for(p in packages)
{
  if(!require(p,character.only = T))
  {
    install.packages(p)
  }
  library(p,character.only = T)
}

1. Data Preparation

The given task: Create a data visualization to segment kid drinks and other by nutrition indicators. For the purpose of this task, starbucks_drink.csv should be used.

For the purpose of this task, the data set was downloaded from the course e-learn platform and saved to the exercise folder in the course project repository. The following code chunk was used to load the data set into R.

starbucks_drinks <- read_csv("data/starbucks_drink.csv")

In the above code chunk, read_csv function from tidyverse package was used to load the data set as the given data set is in .csv format. The data set is named “starbucks_drinks”.

Let’s have a look at the data set.

paged_table(head(starbucks_drinks))

If a data frame has a large number of rows it might not be practical to display it fully inline1. Here, the rmarkdown::paged_table() function that allows pagination of rows and columns of a data set was used to load the first 6 rows (head) of the data set inline. This makes the users to view the table view of a data set in an interactive manner.

In the data set view, the data type of the fields are displayed under their respective names. It can be seen that, for the “Caffeine (mg)” field, the data type is which is incorrect as it contains numeric values.

The as.numeric() function is used to change the data type of the “Caffeine(mg)” field to numeric type and the class function is used to confirm that the data type of the “Caffeine(mg)” field is “numeric”.

starbucks_drinks["Caffeine(mg)"] <- as.numeric(starbucks_drinks$`Caffeine(mg)`)
class(starbucks_drinks$`Caffeine(mg)`)
[1] "numeric"

Now, for the purpose of the given task, the starbucks_drinks data set needs to be filtered on the “Category” field where the field is equal to “kids-drinks-and-other”. filter function is used as follows to accomplish this step.

kids_drinks <- starbucks_drinks %>%
  filter(Category == "kids-drinks-and-other" )

The above paged table view also indicates that there are a few missing values in our data set, let us have a look at the missing values.

colSums(is.na(kids_drinks))
             Category                  Name        Portion(fl oz) 
                    0                     0                     0 
             Calories     Calories from fat          Total Fat(g) 
                    0                     0                     0 
     Saturated fat(g)          Trans fat(g)       Cholesterol(mg) 
                    0                     0                     0 
           Sodium(mg) Total Carbohydrate(g)      Dietary Fiber(g) 
                    0                     0                     0 
            Sugars(g)            Protein(g)          Caffeine(mg) 
                    0                     0                     0 
                 Size                  Milk         Whipped Cream 
                    0                     5                     5 

The colSums() function give sum of values per columns and isna() function returns a list of Boolean values indicating if each value in the fields in the data frame is missing or not. The output shows that the fields “Milk” and “Whipped Cream” have 5 missing values respectively. To decide how to deal with the missing values, let us first explore the unique values in these type fields.

#unique values in the Milk field
unique(kids_drinks$Milk)
[1] "Almond"              "Coconut"             "Nonfat milk"        
[4] "Whole Milk"          "2% Milk"             "Soy (United States)"
[7] NA                   
#unique values in the Whipped Cream field
unique(kids_drinks$`Whipped Cream`)
[1] "No Whipped Cream" "Whipped Cream"    NA                

Intuitively, we can fill the missing values with “No Milk” and “No Whipped Cream” in the fields “Milk” and “Whipped Cream” respectively.

kids_drinks$Milk[is.na(kids_drinks$Milk)] <- 'No Milk'
kids_drinks$`Whipped Cream`[is.na(kids_drinks$`Whipped Cream`)] <- 'No Whipped Cream'

Let us look at the summary of the data set starbucks_drinks to find out any discrepancies or outliers in the data fields.

summary(kids_drinks[,3:15])
 Portion(fl oz)     Calories     Calories from fat  Total Fat(g)   
 Min.   : 8.00   Min.   : 90.0   Min.   :  0.00    Min.   : 0.000  
 1st Qu.: 8.00   1st Qu.:192.5   1st Qu.: 50.00    1st Qu.: 6.000  
 Median :12.00   Median :270.0   Median : 80.00    Median : 9.000  
 Mean   :13.51   Mean   :283.8   Mean   : 83.24    Mean   : 9.177  
 3rd Qu.:16.00   3rd Qu.:350.0   3rd Qu.:110.00    3rd Qu.:12.750  
 Max.   :24.00   Max.   :650.0   Max.   :220.00    Max.   :24.000  
 Saturated fat(g)  Trans fat(g)     Cholesterol(mg)   Sodium(mg)   
 Min.   : 0.000   Min.   :0.00000   Min.   : 0.00   Min.   : 10.0  
 1st Qu.: 2.000   1st Qu.:0.00000   1st Qu.: 0.00   1st Qu.:115.0  
 Median : 4.750   Median :0.00000   Median :20.00   Median :160.0  
 Mean   : 5.149   Mean   :0.03053   Mean   :21.49   Mean   :172.4  
 3rd Qu.: 7.000   3rd Qu.:0.00000   3rd Qu.:30.00   3rd Qu.:210.0  
 Max.   :15.000   Max.   :0.50000   Max.   :75.00   Max.   :460.0  
 Total Carbohydrate(g) Dietary Fiber(g)   Sugars(g)    
 Min.   :14.00         Min.   :0.00     Min.   :12.00  
 1st Qu.:28.00         1st Qu.:0.00     1st Qu.:25.00  
 Median :39.50         Median :1.00     Median :37.00  
 Mean   :42.47         Mean   :1.79     Mean   :38.67  
 3rd Qu.:53.00         3rd Qu.:3.00     3rd Qu.:48.75  
 Max.   :99.00         Max.   :7.00     Max.   :85.00  
   Protein(g)      Caffeine(mg)   
 Min.   : 0.000   Min.   :  0.00  
 1st Qu.: 4.000   1st Qu.:  0.00  
 Median : 7.000   Median :  0.00  
 Mean   : 8.237   Mean   : 10.25  
 3rd Qu.:12.000   3rd Qu.: 20.00  
 Max.   :19.000   Max.   :225.00  

The summary() function gives us the minimum, maximum, mean, median, 1st quartile and 3rd quartile values for each of the fields. The field “Caffeine(mg)” seems to have outliers as the maximum value is significantly greater than the 3rd quartile. Let us plot a boxplot for the field to confirm if there are outliers.

ggplot(kids_drinks, aes(y=`Caffeine(mg)`)) + 
  geom_boxplot(fill="gray") +
  scale_color_grey() + 
  labs(title="Box plot of Caffeine(mg)", subtitle = "Caffeine(mg) has an outlier") + 
  theme_classic()

The box plot confirms that there is an outlier with value over 200 mg, while the rest of the data points have values less than 50 mg. The outlier was filtered away using the following piece of code.

kids_drinks <- kids_drinks %>%
  filter(`Caffeine(mg)` <200)

Let us also check if the data set contain duplicate records.

sum(duplicated(kids_drinks))
[1] 0

The sum() function is used to add the number of duplicate records returned by applying the duplicated() function on the data set. There are no duplicate records in the data set.

Moving on to segmentation of kids drinks and other category based on nutritional factors, we can see that there are fields “Names”, “Milk”, “Whipped Cream” and “Size”/“Portion(oz)” of the drink that affect the values of the nutritional factors. The field “Size” corresponds to “Portion(oz)”. We need to combine these fields to get the drink type for this category. Let us look at the correlation plot of the numerical fields in the data to see if the field “Portion(oz)” is correlated with the rest of them. A correlation analysis is done using the ggcorrmat() function of the ggstatsplot package at confidence level of 95%.

#summary(kids_drinks_and_other)
ggstatsplot::ggcorrmat(
  data = kids_drinks, 
  cor.vars = 3:15,
  ggcorrplot.args = list(outline.color = "black", 
                         hc.order = TRUE,
                         tl.cex = 14),
  title    = "Correlogram for Starbucks dataset",
  subtitle = "7 pairs are not significantly correlated at p < 0.05",
   colors = c("#CC6600", "white", "#000066"), outline.color = "black",
  ggtheme = theme_minimal()
)

The above correlation plot shows that the field “Portion(oz)” is not correlated with many of the nutritional factors which means the values of the nutritional factors do not change linearly with change in the portion of the drink, therefore it cannot be used to aggregate the nutritional values. Now, we need to create drink types using the fields “Names”, “Milk”, “Whipped Cream” and “Size”. The paste() function is used to create a new field “kids_drink_type” as follows:

kids_drinks$kids_drink_type <- paste(kids_drinks$Name," ",kids_drinks$Milk," ",
                                     kids_drinks$`Whipped Cream`,"-", kids_drinks$Size)

For the purpose of segmentation, hierarchical clustering is done on the data set. The data set for clustering needs to in matrix format because for this purpose heatmaply will be used which takes matrix as input. heatmaply produces interactive cluster heat maps using ‘Plotly’ and ‘ggplot2’ packages in R. A heat map is a graphical method for plotting high dimensional/ multivariate data, in which a table of numbers are encoded as a grid of colored cells.

From the data set, relevant fields are selected using the select() function from dplyr package in R. The fields have different range of values, and their distribution is also unknown. Therefore, normalization is performed on the data set to change the values of numeric columns in the data set to a common scale, without distorting differences in the ranges of values.

The correlation plot will only use numerical values, therefore row names are set for the matrix so that the resulting heat map is labelled as per the row names.

kids_starbucks <- dplyr::select(kids_drinks, c(19, 4:15))
kids_starbucks[, 2:13] <- normalize(kids_starbucks[, 2:13])
row.names(kids_starbucks) <- kids_starbucks$kids_drink_type

Hierarchical clustering2 is a general family of clustering algorithms that build nested clusters by merging or splitting them successively. This hierarchy of clusters is represented as a tree (or dendrogram). The root of the tree is the unique cluster that gathers all the samples, the leaves being the clusters with only one sample. The Agglomerative Clustering object performs a hierarchical clustering using a bottom up approach: each observation starts in its own cluster, and clusters are successively merged together. The linkage criteria determines the metric used for the merge strategy:

To find the best method of linkage during hierarchical clustering, the data set is first converted into a matrix using the data.matrix() function. Then the hclust() function is used on the matrix (excluding the kids drink type field) to do the hierarchical clustering on the matrix and then dend_expend() function is used. - dend_expend3: A list with three items. The first item is called “dends” and includes a dendlist with all the possible dendrogram combinations. The second is “dists” and includes a list with all the possible distance matrix combination. The third. “performance”, is data.frame with three columns: dist_methods, hclust_methods, and optim. optim is calculated (by default) as the cophenetic correlation between the distance matrix and the cophenetic distance of the hclust object.

kids_drinks_matrix <- data.matrix(kids_starbucks)
kdm_clust <- dist(kids_drinks_matrix[,-c(1)], method = "euclidean")
dend_expend(kdm_clust)[[3]]
  dist_methods hclust_methods     optim
1      unknown         ward.D 0.6411955
2      unknown        ward.D2 0.7193447
3      unknown         single 0.6903938
4      unknown       complete 0.7190618
5      unknown        average 0.7999837
6      unknown       mcquitty 0.6929131
7      unknown         median 0.5686416
8      unknown       centroid 0.7667487

We see that the optim value is highest for the “average” linkage, therefore it was used to create the cluster heat map. Before plotting the heat map, let us first get the optimum number of clusters in our data set using the find_k() function. This uses the average silhouette approach. In short, the average silhouette approach measures the quality of a clustering. That is, it determines how well each object lies within its cluster. A high average silhouette width indicates a good clustering. The average silhouette method computes the average silhouette of observations for different values of k. The optimal number of clusters k is the one that maximizes the average silhouette over a range of possible values for k4.

The optimal number of clusters for this data set is found to be 10. Therefore in the heatmaply() function below, we have put the argument “k_row” equals to 10. The argument “Colv” is set to “NA” because we are only segmenting the rows (kids drink types). Euclidean distance is used as for calculating the cluster distances and average linkage is used to find the most suitable cluster for each drink type as identified earlier. The “seriate” argument from the “seriation” package to find an optimal ordering of rows and columns. Optimal means to optimize the Hamiltonian path length that is restricted by the dendrogram structure. This, in other words, means to rotate the branches so that the sum of distances between each adjacent leaf (label) is minimized. Here, the argument is set to “GW” (Gruvaeus and Wainer) as it potentially leads to faster heuristics5.

Plotting the cluster heat map:

heatmaply(kids_drinks_matrix[,-c(1)],
          Colv=NA,
          seriate = "GW",
          dist_method = "euclidean",
          hclust_method = "average",
          k_row = 10,
          fontsize_row = 2.5,
          fontsize_col = 5,
          colors = Reds,
          main="Kids Drinks Types at Starbucks by their nutritional value \n10 clusters are obtained\n",
          xlab = "Kids Drinks Type (Starbucks)",
          ylab = "Nutritional Factors")%>% layout(height=800,width=600)

Now let us save the clusters.

cut_avg <- cutree(kids_drinks_clust, k = 10)
kids_drinks_clusters <- mutate(kids_starbucks, cluster = cut_avg)
kids_drinks_clusters$cluster = as.factor(kids_drinks_clusters$cluster)
paged_table(head(kids_drinks_clusters[,c(1,14)]))
ggparcoord(data = kids_drinks_clusters, 
           columns = c(2:13), 
           groupColumn = 14,
           scale = "uniminmax",
           alphaLines = 0.2,
           boxplot = TRUE, 
           title = "Multiple Parallel Coordinates Plots of Kids Drinks Types at Starbucks by Nutrition Factors") +
  facet_wrap(~cluster) + 
  theme(axis.text.x = element_text(angle = 30, hjust=1))

References:

  1. https://rstudio.github.io/distill/tables.html
  2. https://scikit-learn.org/stable/modules/clustering.html
  3. https://www.rdocumentation.org/packages/dendextend/versions/1.15.1/topics/dend_expend
  4. https://uc-r.github.io/kmeans_clustering
  5. https://isss608-ay2021-22t2.netlify.app/hands-on_ex/hands-on_ex05/hands-on_ex05-heatmap